Combining Words and Compound Terms for Monolingual and Cross-Language Information Retrieval
نویسندگان
چکیده
Most existing systems of Information retrieval (IR) use single words as index to represent the contents of documents and queries. One of the consequences is the low recall level. In this paper, we propose to integrate compound terms as additional indexing units because terms are more precise representation units than words. Terms are recognized through the use of a terminology database and an automatic term extraction tool, which is based on syntactic templates and statistical analysis. In this paper, we first show that the use of compound terms is greatly beneficial to monolingual IR. Then compound terms are incorporated in statistical translation models trained on a large set of parallel texts. Our experiments on cross-language information retrieval (CLIR) show that such a translation model leads to a much better CLIR effectiveness when compound terms are integrated.
منابع مشابه
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our sy...
متن کاملCombination Approaches in Information Retrieval: Words vs. N-grams and Query Translation vs. Document Translation
This paper reports our proposal and experimental results at the NTCIR-4 CLIR task. For monolingual information retrieval, we use a combination strategy that integrates words and n-grams at the ranked list level. In combining words and n-grams, we concentrate on generating several ranked lists showing different retrieval characteristics on word and n-gram indexes by incorporating feedback scheme...
متن کاملA Language-Independent Approach to European Text Retrieval
We present an approach to multilingual information retrieval that does not depend on the existence of specific linguistic resources such as stemmers or thesaurii. Using the HAIRCUT system we participated in the monolingual, bilingual, and multilingual tasks of the CLEF-2000 evaluation. Our method, based on combining the benefits of words and character n-grams, was effective for both language-in...
متن کاملCross-Language Information Retrieval for Technical Documents
This paper proposes a Japanese/English crosslanguage information retrieval (CLIR) system targeting technical documents. Our system rst translates a given query containing technical terms into the target language, and then retrieves documents relevant to the translated query. The translation of technical terms is still problematic in that technical terms are often compound words, and thus new te...
متن کاملSemantic annotation for concept-based cross-language medical information retrieval
We present a framework for concept-based cross-language information retrieval in the medical domain, which is under development in the MUCHMORE project. Our approach is based on using the Unified Medical Language System (UMLS) as the primary source of semantic data. Documents and queries are annotated with multiple layers of linguistic information. Linguistic processing includes part-of-speech ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002